The following report presents the data preparation process for the Remote Working Survey, conducted in Australian state of New South Wale in August-September 2020. The resulted dataset will be further investigated in the STATA statistical package to gain significant insights from the data.
First of all, we will start the data preparation with renaming columns, as the existing ones are namely the list of questions.
There are 4 conventions assumed for te following dataset: -bp columns means answers for the question From the following, please select the most significant barrier to doing your work remotely, sp — From the following, please select the least significant barrier to doing your work remotely, ba — Compare remote working to working at your employer’s workplace. Select the best aspect of remote working for you, wa — Compare remote working to working at your employer’s workplace. Select the worst aspect of remote working for you
From now on, we could continue the initial data investigation process.
Let’s have a look at Age column, which will be the difference between YearBorn column and 2020
We notice that there are 4 significant outliers in the Age distribution. For now, we will keep 2 respondents with Age around 80 and drop 2 definitely wrong answers.
Going further, we will look at the gender distribution in the dataset
As we could see, there are 200 more men in the dataset. Moreover, we notice 2 NA’s, which will be thrown away as their amount is too insignificant to keep.
Next, we will look at the Industry’s and IndustryAdv’s distributions.
IndustryAdv column seems to be too diversified to be used in a sample of 1500 respondents. Even the Industry column looks like too rare for some categories. Therefore, we will recode the Industry column to a new one called Sector, which will be divided into 4 categories: primary and other (farming, mining, etc.), secondary (manufacturing, etc.), tertiary quinary (public services, healthcare, etc.) and tertuary quaternary (intellectual professionals).
Current dataset contains responses from people working in the tertiary sector mainly, which seems quite obvious as it is hard to imagine a farmer working 100% remotely. Therefore, it may be unnecessary to use this variable as a predictor.
We will proceed the analysis with investigation of position-related columns: Occupation, OccupationAdv, IsManager
## Warning: Ignoring 135 observations
Unfortunately, as IsManager column contained missing observations, it won’t be used as regressor. Furthermore, the OccupationAdv column will be dropped, just as the IndustryAdv. Finally, all the minor categories for the Occupation variable will be renamed to ‘Other’
Next, we will investigate the CompanySize, HouseHold, Experience, Transportation columns
Transportation and Experience columns will be kept as they are. In case of CompanySize, though, it will be restricted to 3 categories: 1-19, 20-199 and 200+. As for the HouseHold variable, it will contain the couple+children, couple, single and other levels.
Next, we will explore variables related to actual time being spent remote and the preferred amount of remote work in hours. The AfterCovid variable will be used to refer to a variable expressing preferation, as the further analysis will concentrate on the after-covid times.
As we are mostly interested in the remote and hybrid work, we will throw away respondents which spend (almost) all their working hours onsite. Moreover, we will assume that people working remotely in less than in 20% cases will be classified as ‘almost onsite’, from 20 to 90 - working ‘hybrid’, and those working 100% will be named ‘remote’ workers.
Regarding the CompanyEncourageToRemote and CompanyPreparedForRemote, just the same as it was with the TimePrefferedAfterCovid, they will be dropped as we are interested in the long-twerm preferencies. As for the EasyCollaborateRemote column, it will not be used throughout the research, as it is correlated with the sp and pb columns which will be discussed later. As for the RecommendRemote and ChooseRemote columns, they will be restricted to yes|no|hard to say categories to avoid small groups division later during the analysis and model creation.
Going further, as CommonToRemote and EasyToRemote are pretty strngly correlated (see code output), we will use only CommonToRemote one. The same story goes with CompanyEncourageToRemoteAfterCovid and CompanySupportRemoteAfterCovid as it is naturally hard to imagine encouraging remote work without suppoting it.
df_temp1 <- df %>% filter(EasyRemote!="hard to say", CommonRemote!="hard to say")
df_temp2 <- df %>% filter(EncourageRemote!="hard to say", SupportRemote!="hard to say")
print(round(sum(df_temp1$EasyRemote != df_temp1$CommonRemote)/nrow(df_temp1), 4)*100)
## [1] 8.31
print(round(sum(df_temp2$EncourageRemote != df_temp2$SupportRemote)/nrow(df_temp2), 4)*100)
## [1] 5.13
Next we continue our analysis with hours-related variables. Firstly, we will replace all the zero variables with 0.01. This will be needed to take into account cases when the difference between onsite and remote hours spent would be noticable. Secondly, we will exclude observations with total hours spent on mentioned activities greater than 24.
Looking at the densities of mentioned variables, it seems like a log-transformation or simple value replacement is supposely to be used.
One of the hardest variables in terms of data preparation are the biggest and smallest challenged faced during the remote work. It will be calculated as follows: 1) all the unique answers for questions bp1-bp15 will be counted for each individual 2) all the unique answers for questions sp1-sp15 will be counted for each individual 3) the aforementioned tables will be joined, and the counts for each unique answers for each ID (respondent) will be subtracted 4) 0 values will be replaced with ‘no clear challenge’ 5) negative values will be replaced with ‘Did not face any big challenge’ 6) the most popular challenge,though, will be classified as the biggest challenge faced
As we could see, the resulted BiggestChallenge column is too sparse. Therefore, we will combine all the IT/equipment issues into single one called ‘Technical’. We will also unite my workspace and my living situation answers into single one. The rest will be just shortened to more human-like options.
Almost the same procedure will be applied to the biggest benefit problem with the only difference in final category’s names
As in case of BiggestChallenge column, the number of categories will be cut down. Ability to socialize and other relatioship variables will be united into single group. Commuting, on-learinng opportunities will be united into category “Professional”. Family responsibilities will be included as part of work-life balance The left 3 categories will be left unchanged.
Last but not least is the Prodcutivity variable which will be the core interest in the further research.
We should note that there are less people that claims working less productive by 40% and 50% percent. it might be also related with the strach of being suspected in working not effectively and being fired, as the productivity was measured by the respondents, not third parties. Respondents might also overestimate their productivity to make more impression, especially if we are talking about small companies (ex. 1-4 people), where it will be pretty obvious which answers belong to whom, regarding the anonimousity of such a survey. Nevertheless, this aspect will be discussed further during the very research.
The study will concentrate on the very fact if the change in productivity occured at all and in what direction. as we will not be interested in the degree of that change, we will restrict the variable to only 3 possible options.
In this way, we cleared dataset to up to 1250 observations with 27 potential regressors. The datasat will be saved to be used in STATA
df
## # A tibble: 1,250 x 27
## ID Gender Occupation CompanySize HouseHold Experience Transportation
## <int> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Female Clerical and ad~ 20-199 Couple More than~ Regional
## 2 2 Male Managers 1-19 Couple+c~ More than~ Metro
## 3 3 Male Managers 200+ Other More than~ Metro
## 4 4 Female Professionals 20-199 Couple+c~ Between 1~ Metro
## 5 6 Male Other 20-199 Single Between 1~ Metro
## 6 7 Male Clerical and ad~ 200+ Couple+c~ More than~ Metro
## 7 8 Male Managers 1-19 Couple+c~ Between 6~ Metro
## 8 10 Male Managers 200+ Other More than~ Metro
## 9 11 Female Clerical and ad~ 200+ Single Between 6~ Metro
## 10 12 Male Professionals 200+ Couple More than~ Metro
## # ... with 1,240 more rows, and 20 more variables: Remote <chr>,
## # CompanyCommonToRemote <chr>, CompanyEasyToRemote <chr>,
## # RecommendRemote <chr>, RemotePreffered <chr>,
## # CompanyEncourageToRemoteAfterCovid <chr>,
## # CompanySupportRemoteAfterCovid <chr>, ChoiceToRemoteAfterCovid <chr>,
## # Productivity <chr>, Age <dbl>, Sector <chr>, ChooseRemote <chr>,
## # CommonRemote <chr>, SupportRemote <chr>, WorkingHoursDiff <dbl>, ...
df %>% write_csv("data/remote_productivity.csv")